Non Linear Frequency Warping In Speaker 

Normalization 


by 

Vinay M. K. 


-TH 

€ e/ipti/n 

V 7J9>i» 



DEPARTMENT OF ELECTRICAL ENGINEERING 

INDIAN INSTITUTE OF TECHNOLOGY, KANPUR 

February, 2001 



Non Linear Frequency Warping In Speaker 

Normalization 


A Thesis Submitted 

in Pastial Ful&lment of the Requirements 
for the Degree of 

Master of Technology 


by 

Vinay M. K. 



Department of Electrical Engineering 

Indian Institute of Technology, Kanpur 


February, 2001 





Certificate 


This is to certify that the work contained in the thesis entitled “iVon Linear 
Frequency Warping In Speaker Normalization^ by Vinay M. K., has been carried 
out under my supervision and that this work has not been submitted elsewhere for a 
degree. 


February, 2001 



(Dr. S. Umesh) 

Department of Electrical Engineering, 
Indian Institute of Technology, 
Kanpur. 



Abstract 


In an effort to utilize the additional information available from the classical 
speech analysis studies regarding the nature of spectral scaling among speakers, a 
non-linear scaling function is proposed, for speaker normalization. The proposed 
non-linear scaling function is independent of the phoneme class and is completely 
derived from vowel formant database. This non-linear scaling function has been 
analysed using various methods like formant data analysis, spectral alignments and 
HMM-based recognizers. Using separability measures like F-ratio and residual vari- 
ance, the proposed method is found to be superior to linear scaling for formant 
data analysis. A warping function and hence a non-linear scale invariant transfor- 
mation is derived from this non-linear scaling function, to be incorporated into a 
HMM-based recognizer. Using recognition accuracy as a performance measure the 
proposed transformation is compared with other similar schemes. Our results sug- 
gest that further studies are necessary to enable effective use of non-linear scaling 
functions on HMM-based recognizers. 



Acknowledgements 


At the outset I would like to thank the Almighty for being what he is, benevolent. 

I would like to express my gratitude to my thesis supervisor Dr. S. Umesh for his 
constant encouragement and valuable guidance throughout the course of my thesis 
work. I thank him for his concern towards my well being and for showering on me 
this treasure of knowledge from both in and out of academic field. 

I am grateful to all my teachers at IIT, Kanpur, for their valuable guidance. I 
would like to thank my close friend, critic, guide and lab mate Rohit for all the help, 
which I cannot put in words. I would like to thank Shafi for his valuable tips and 
warm friendship. I thank Shantakka, Putty and Dr Raghavendra for providing me 
with a “home” at IIT Kanpur. I thank all the Kannada Sangha members and friends 
(Keshava, Vivek, Phani and Anand) for the joyous moments we had together. 

Thanks to all my DSP classmates(Rajiv, Kaushal and Prabhat) and my beloved 
Sanketh (Teju, Ravi, Asheesh and Keshava) for making my stay at IIT Kanpur a 
memorable one. Finally, Thanks to my sweet hearts appa, amma and vicky without 
whom I wouldn’t have been what lam. 


Contents 


1 Introduction 1 

2 Non-Uniform Vowel Normalization 4 

2.1 Simple Linear Scaling 4 

2.2 Non-Linear Scaling 6 

2.3 Proposed Non-Linear Scaling 9 

2.3.1 Weighting Factor as a Function of Frequency 9 

2.3.2 Experiments and Results 12 

2.4 Summary 16 

3 A Warping Function for Non-uniform Vowel Normalization 18 

3.1 Non-linear Scaling of Continuous Vowel Spectra 18 

3.1.1 Vocal Tract Length Estimation 19 

3.1.2 Experiments and Results 19 

3.2 Approximate Scaling Function 22 

3.3 Warping-Function 22 

3.4 Summary 28 

4 Recognizers With Scale Factor Estimation 29 

4.1 Hidden Markov Model Based Speech Recognizer 30 

4.1.1 Feature Extraction 30 

4.1.2 Pattern Recognition 33 

4.2 Non-uniform Normalization on the Recognizer 34 

4.2.1 Scaling Factor Estimation in ML Sense 34 

4.2.2 Training and Testing Procedure 35 


IV 



4.2.3 Non-Linear Scaling With Filterbank Analysis 36 

4.2.4 Experiments and Results 39 

4.3 Summary 41 

5 Recognizers With Non-Linear Scale Invariance 42 

5.1 Non-linear Scale Invariance on a Recognizer 42 

5.1.1 Non-Linear Scale Invariant Transformation 43 

5.1.2 Experiments and Results 46 

5.2 Summary 48 

6 Conclusions 49 


V 



List of Figures 


2.1 Deviations from linear scaling 5 

2.2 Weighting factors for females and children with reference male 7 

2.3 Weighting factors for children with reference male and female 8 

2.4 Weighting factor distribution for /lY/ and /UH/ categories 11 

2.5 Proposed non-linear scaling function 12 

2.6 Effect of normalization on female cluster 14 

2.7 Scatter plots of Fi — Fo for 10 Vowels 17 

3.1 Scatter plot of estimated VTL 20 

3.2 Various scaling functions as applied to the front vowel /IH/ 21 

3.3 Various scaling functions as applied to the open vowel /AE/ 22 

3.4 Various scaling functions as applied to the rounded vowel /UH/. ... 23 

3.5 Approximate scaling function 24 

3.6 Approximate scaling function as applied to the front vowel /IH/. ... 25 

3-7 0(f) for non-linear scaling schemes 26 

3.8 Warping functions for various non-linear scaling schemes 27 

4.1 Block diagram of a continuos speech recognizer 30 

4.2 Feature Extraction stages 32 

4.3 Recognition with speaker normalization 37 

4.4 Mel filterbank analysis with frequency warping 38 

5.1 Log warping of two scaled signals 44 


VI 



List of Tables 


2.1 Formant and vowel specific scale factors, Knfm 10 

2.2 Cluster discriminability in terms of F-Ratio ' • 15 

2.3 Residual variance after normalization 16 

3.1 (3 values as used by Umesh et al 28 

4.1 Mel filter boundaries and 7 (/) 38 

4.2 Recognition performance 40 

5.1 Eight band non-linear frequency warping Implementation 46 

5.2 Fifteen band non-linear frequency warping Implementation 46 

5.3 Recognition performance of non-linear scale-invariant transformations 48 


VII 



Chapter 1 
Introduction 


Automatic speech recognition (ASR) system enables a computer (or a machine) to 
recognize the words spoken by a person. ASR is essentially the process of convert- 
ing speech to text. An ideal ASR system should be able to recognize with 100% 
accuracy all the words (as defined by the dictionary) that are spoken by any person 
independent of vocabulary size, speaker characteristics, accent, noise and channel 
characteristics. But the present day ASR systems are quite far from being ideal. 
There are broadly two classes of ASR systems (1) Speaker dependent and (2) Speaker 
Independent systems. This distinction is purely based on the kind of training dataset 
used. Speaker dependent (SD) systems are trained from speech data collected from 
a single user, who is the sole user of the system. Speaker independent systems (SI) 
on the other hand are trained from speech collected from many different users. Typ- 
ical applications of SD systems are in desk-top applications, word processing, etc. 
While SI systems are typically used at public interfaces like airline interface system, 
telephone directory service, etc where there are varied type of speakers. The param- 
eters of the acoustic models of a SI system are estimated from speech collected from 
a large population of speakers, in order to model the the large speaker variability 
in the user population. While the SI systems yield better recognition rates for test 
speakers who are not in the training dataset than speaker dependent systems, they 
are less accurate (have 2-3 times higher word error rates) than adequately trained 
speaker dependent systems for a given speaker who has contributed to the training 
dataset. This degradation in the performance of SI systems over SD systems for a 
given speaker is mainly due to large speaker variability present in the training set of 


1 



SI systems. Speaker normalization techniques aim to remove these speaker specific 
variabilities from the SI systems. 

Because of the superior performance of SD systems over SI systems for 
a given task and speaker it is clear that speaker variabilities play a major role in 
recognition performance. Speaker variabilities can be either due to non-physiological 
factors like dialect, emotions, speaking idiosynchronacies, background noise etc., or 
due to intrinsic factors like changes in vocal tract physiological characteristics, pitch, 
etc. which are usually dependent on the age and the gender of the speaker. 

There are two broad approaches to being robust to some of these variations 
(1) Model based approaches like Maximum likelihood linear regression (MLLR) 
[1], Speaker Adaptive Training (SAT) [2], etc., derive a transformation for the model 
parameters to arrive at a compact model sans speaker and other variabilities. (2) 
Feature based approaches like aflfine transformations in feature space (cepstral 
mean subtraction [3]), features derived from scale-invariant transformations [4], scale 
factor estimation and subsequent compensation schemes [5] [6], etc., aim to extract 
features which are invariant to speaker dependent variations. 

Vocal tract length (VTL) variation has been one of the major contributors 
to these speaker- variabilities [7]. It has been found that VTL variation causes scaling 
in the spectral domain [8]. Formant (spectral peak) position variations are found 
to be closely associated with the length variations of the vocal tract components 
[7] [9] [10]. Most of the normalization techniques assume that, the linear scaling of 
formants (spectra in general) to be the main source of variation and hence needs to 
be compensated. Compensating for this variabihty by re-scaling the frequency axis 
provides substantial improvements in speech recognition performance [6] [8] [11]. 
But with non-pfoportional variations of the vocal tract component lengths among 
speakers, scaling has been found to be highly context dependent [12] and hence 
non-linear. These experiments suggest that a more detailed modelling of “scaling 
function” of the frequency axis may be helpful in understanding vowel (or in general 
phoneme) perception [13] [14] [15]. 

In the first half of this thesis we have attempted to model these non- 
linearities in scaling as a function of frequency alone and to decouple it from context 
dependence. An approximate non-lineax scaling function for vowel normalization 
motivated from the works of Fant [12] has been derived. This function is general- 


2 



ized to be applicable to continuous vowel spectra. This generalized non-linear scaling 
function has been applied to a “Hidden Markov Model (HMM)” based recognizer 
with scale factor estimated in maximum likelihood (ML) sense. The second half 
of the thesis is motivated from the scale-invariant transformations [15]. A warping 
function is derived from our proposed non-linear scaling function. This warping 
function is incorporated into a non-linear scale invariant transformation so as to be 
applicable on a HMM-based recognizer. We have used different analysis methods 
such as formant data analysis, spectral ahgnments and HMM based recognizers, to 
compare the performance of the proposed methods with other similar techniques. 

The thesis is organized as follows. In Chapter 2 the motivation for non-linear 
scaling is provided. This non-linearity is expressed as a pure function of frequency. 
A simple procedure for non-uniform vowel normalization on “formant frequencies” 
is presented. The procedure is evaluated in terms of F-ratio, residual variance and 
other separability measures. In Chapter 3 the proposed non-uniform normaUzation 
scheme is generalized to continuous spectra and is evaluated for subjective spectral 
alignment of vowel data. A warping function is defined from this non-linear scaling 
model so that it can be incorporated in a non-linear scale invariant transformation for 
HMM-based recognizer. In Chapter 4 a method for incorporating the proposed non- 
linear scaling function in a HMM-based recognizer has been presented. In Chapter 
5 a non-linear scale invariant transformation from the proposed warping function is 
derived and used in a HMM based recognizer. Performance of these algorithms are 
evaluated, analysed and compaxed with respect to that of other similar techniques, 
using percentage correct and percentage accuracy as performance measures. Finally, 
in Chapter 6 using all the experimental results conclusions are drawn about the 
effectiveness of the proposed non-hnear warping function in a recognizer/classifier 
framework. 


3 



Chapter 2 

Non-Uniform Vowel Normalization 


In speech recognition, speaker dependence of the speech signal is mainly due to the 
variation in the vocal tract length of the speakers. An average adult male has a 
vocal tract length (VTL) of around 17cm, while an average female VTL measures 
around 14.5cm. With uniform tube as a first order approximation to the vocal tract, 
it is obvious that on an average the formants of an average female speaker are scaled 
up by 20% with respect to that of an average male speaker. Hence, it is commonly 
assumed that differences in formant patterns between male and female speakers, are 
related by a pure scale factor which is inversely proportional to the “overall” vocal 
tract length. 

2.1 Simple Linear Scaling 

Nordstrom & Lindblom [ 8 ] have suggested a simple normalization procedure, based 
on the estimate of the speaker’s average vocal-tract length in open vowels, as de- 
termined from measurement of the third formant Fz- In their procedure of imiform 
scaling, the formant frequency of the subject to be normalized is simply divided by, 

a = (1 + 4/100) = ^ = ^:^ (2,1) 

where k is the scale factor in percentage, lav is the vocal-tract length associated with 
the subject’s average F 3 of open vowels (vowels with Fi greater than 600Hz) and 
Iref is the vocal- tract length of the reference “male” speaker. 


4 



FI 


Figure 2.1: Deviations from linear scaling. 

Figure shows the actual values of average male (Fi, Fj) and average female (Fi, F 2 ) 
points for various vowel categories [16]. Dashed line indicates predicted (FijFz) 
point, with a linear scaling a = 1.17. Variation in the distances, between predicted 
and actual points, over vowel categories is evident. Fi & F 2 are in Hz. 


But in general formant frequency locations for vowels are affected by tiiree 
factors: the overall length of the pharyngeal-oral tract, the location of constrictions 
along the tract, and the narrowness of the constrictions[10]. Simple scale rule ne- 
glects both the location of the constrictions and the vocal tract shape. The actual 
nature of scaling among vowel categories can be seen in Fig 2.1. It can be seen from 
the figure, that the scaling is not uniform, but is a function of both formant number 
and vowel category. 


5 




2.2 Non-Linear Scaling 


The observations suggest gross deviations from linear scaling. “Why are these devia- 
tions present?” Various vowel articulatory constraints depend upon different vocal 
cavity dimensions. It has been found that F 2 is pharynx dependent while F 3 is 
mouth cavity dependent. These “formant-cavity” affiliations, coupled with unequal 
ratios of pharyngeal to oral cavity lengths between males and females, results in 
these gross deviations. A non-uniform vowel normahzation procedure suggested by 
Fant [12], where in, a weighting factor fcn/, which is both a function of formant num- 
ber and vowel category, is applied to the ratio of the subject’s particular k (Eqn 2.1) 
to the k = 17% of the reference female. With this non-uniform normalization, Fant 
showed substantial reduction in speaker differences between males and females. 

For a given vowel category, the formant specific weighting factor is denoted 
by knf in general. To identify the subject and reference speaker classes additional 
suffixes are added, for eg: Kncf denotes the formant specific weighting factor 
with children as the subjects and female. as the reference speaker. 

For female speakers. 

For children speakers, 

"" (eI male ~ 0 

where “Fn female”, “Fn child” and male” are the average formants of 
female, children and male cluster of speakers respectively, for a vowel category. 

If the linear-scale rule were to be true these factors should have turned out 


to be constant (a — 1), over the formant numbers and the vowel categories. But 
Fig 2.2 shows, that there are systematic and substantial variations of these factors 
with respect to vowel categories and formant numbers. To analyze the child-female 
scaling variations let us define a new set of formant specific weighting factors, 

with female ais the reference speaker and child as the subject speaker. These factors 
are plotted in Fig 2.3. It can be observed from the figure that, these factors are less 
formant and vowel specific as compared to child-male weighting factors. 


6 






Figure 2.2: Weighting factors for females and children with reference male. 

Figure shows the weighting factors in percentage for female and children groups with 
male as the reference speaker. Knfm plots Eqn 2.2, Kncm plots Eqn 2.S. It can 
be observed that Kzfm is less vowel specific and is hence used in scale factor, k, 
estimation. 


Fant’s Normalization Procedure: 

1. Determine the subject’s Fzav 

2. Calculate the ratio Fzav/Fzref = (1 + k/100). 

3. For each vowel to be normalized, select the corresponding female-male formant 
specific scale factor set, knf = fci/, kzf, kzf, ■■■ ■ 

4. To normalize the formant, apply a weighting to the scale factors knf with 
the ratio of the subject’s particular k to the k = 17% of the reference speaker. 

kn = knf.^ (2.5) 


7 









Figure 2.3: Weighting- factors for children with reference male and female. 

Figure shows the weighting factors in percentage for children with female and male 
as the reference speakers. Kncf plots (Eqn 2.4), Kncm plots (Eqn 2.3). Child-Female 
Kncf scaling variations are less vowel and formant specific, when compared to Child- 
Male Kncm scaling variations. Note: Kncm plot is also shown in Fig 2.2. 

where n = formant number. This represents the best prediction of the subject’s 
scale factors for the particular vowel. For child subjects, in which case k values 
will be higher than 24%, 

24 

kn=knS.- + {k-2A) 

The rationale behind this being the tendency of child-female difference to be 
less formant and vowel specific as it is evident from Fig 2.3. 

Note: With male as the reference speaker, scale factor k for average female 
speaker is around 17 (i.e., ot = 1.17). 


V 

f 


8 






2.3 Proposed Non-Linear Scaling 

The main motivation for the work proposed here, is to apply these methods to reduce 
inter-speaker difference for speech recognition/ classification application. While uni- 
form scaling may be easily applied for such an application, it is preferable to use 
non-uniform normalization because of the improved separability. But Fant’s non- 
uniform normalization is not practically implementable, because firstly it requires 
the knowledge of vowel category and formant number, and secondly it is not directly 
applicable to continuous spectral patterns. In the proposed algorithm the idea is 
to model the weighting factor, knf, as a function of frequency alone and hence im- 
plement the non-linear scaling. This algorithm should do away with the need for a 
priori knowledge about the vowel category, while at the same time should do better 
than conventional linear scaling. 

2.3.1 Weighting Factor as a Function of Frequency 

To estimate the weighting factor as a function of frequency, the {F\, Fa, F3) 
triplets, of 33 men and 28 women covering all the 10 vowels [16] spoken in the same 
context were used. Though Fant[12] used a mixture of vowel databases from various 
languages like Swedish, American English, Danish, Estonian, Dutch, Japanese etc., 
in this work only American English vowel-formant database is used for both analysis 
and testing. 

The weighting factor, knj, in Eqn 2.5 is a function of both formant number 
and vowel category, i.e., a vowel and formant specific scale factor. These weighting 
factors, knf-, as obtained for American English database and for a mixture of six 
languages (as used by Fant), are tabulated adjacent to each other for comparision 
in Tab 2.1. This information is averaged over vowel category and formant number, 
to obtain a weighting function, 7, which is purely a function of frequency [17]. 

The plots of knf values for two vowel categories, for all female speakers with respect 
to frequency, are shown in Fig 2.4. Such discrete distributions of knf, for all vowel 
categories are accumulated and then averaged as a function of frequency. Here the 
formant frequencies of all the speakers in a vowel category are mapped on to the 
corresponding formant and vowel specific scale factor, knf- Hence we get an array 


9 



Formant scale factors in % 

k 

1 

k2 

k 

3 

Vowels 

(A) 

(F) 

(A) 

(F) 

(A) 

(F) 

/lY/ 

16 

7 

21 

21 

12 

13 

/IH/ 

12 

11 

24 

22 

19 

18 

/EH/ 

15 

19 

25 

18 

20 

20 

/AE/ 

29 

27 

18 

17 

17 

18 

/AH/ 

20 

18 

18 

18 

16 

16 

/AA/ 

20 

25 

12 

15 

13 

15 

/AO/ 

03 

11 

09 

06 

13 

13 

/UH/ 

07 

03 

13 

12 

19 

18 

/UW/ 

22 

06 

09 

01 

19 

23 

/ER/ 

02 

02 

20 

21 

15 

16 


Table 2.1: Formant and vowel specific scale factors, Knfm Eqn 2.2, 

Here (A) denotes American English vowel database [16] used for analysis. (F) de- 
notes mixture of six languages which Fant used for analysis [12] 


of frequency specific scale factor values, which is not dependent on formant number 
and vowel category. Mathematically, 

T(/n) ^ ^n/ 

where knf is the formant of the vowel category, denotes formant of all 
speakers in the vowel category. Note that for a given frequency /o, corresponding 
knf can have different values depending upon the subject, vowel category and for- 
mant numbers, 7(/o) is computed as a simple average of all the values of fc„/, over 
a small band (lOOHz width) in the vicinity of /<, to get a smooth frequency specific 
scale factor function. 

7ft) = E 7(/«)/V{{/„ € 6,}) (2.6) 

fn €6i 

where Ar(.) cardinality of the set and rr=l,2,3. A plot of 7(/) as a function of 
frequency is shown in Fig 2.5. 


10 




/lY/ 



/UH/ 



Frequecy 


Figure 2.4: Weighting factor distribution for /lY/ and /UH/ vowel categories. 
Figure shows the weighting factors in percentage for female group with male as the 
reference speaker. The discrete distributions ofKnf over formant frequencies (in Hz) 
for various vowel categories are later combined to get a smoother function (Eqn 2.6). 


Proposed Normalization Procedure: 

To normalize the subject’s formant frequency Fn of any vowel, which lies 
in the frequency band, bm (i-e. Fn e bm), divide it by a scale factor given by, for 
adults: 


^nonuniform 


17 

7(^ni)-fc 

-14.53 


{for male reference speaker) 
{for female reference speaker) 


(2.7) 


where FnCbm, k is the subjects uniform scale factor, 
for child {k> 24): 

24 

knonnni f<rrm = 7(&m) - +{k - 24) {foT male reference speaker) 

= {for female reference speaker) 


( 2 . 8 ) 


11 



25 



Figure 2.5: Proposed non-linear scaling function. 

Figure shows the scaling function in percentage for female group with male as the 
reference speaker (Eqn 2.6). 

2.3.2 Experiments and Results 

The motivation for normalization is to improve the separability among vowel clusters 
in (Fi, Fs, Fz) space. Here the effectiveness (in reducing intracluster variance and 
retaining the separation between clusters) of the proposed normalization scheme is 
measured using F-ratio, residual variance after normahzation [12] and scatter plots. 
Peterson and Barney [16] database is used, for both analysis and testing. 

A Subject’s Scale Factor Estimation 

The determination of the average scale factor k is done diflPerently for linear [8] 
and non-linear [12] scaling schemes. Nordstrom and Lindblom have used only open 
vowels (i.e., those vowels with Fi>600Hz, /AA/, /AE/ & /AA/). Subject’s scale 


12 



factor is given by, 


a = {l + kopen/m) = 4 ^ (2-9) 

■^3ref 

where Fsavg aiid Fsref are the average 3’"'^ formant measurements for open vow- 
els, of the subject and of “all male speakers” (reference speaker is male) respec- 
tively. This scale factor has been used for all linear scaling experiments. Apart 
from using the scale factor from open vowels kopen, Pant also used the k value de- 
termined from {F 2 F 3 YF of the front vowel /lY/, with 0.5 weighting. As noted 
earlier F 2 , Fz are known to be related to half- wavelength resonance in the pharynx 
and the mouth cavity. Mouth-cavity length difference between females and males is 
more than pharynx cavity length. This would result in k 2 /iy/ > kz /iy/, such that 
k = 0.5(A:2 /iy/ + kz /iy/)-, thus it is more balanced to include this factor also, in the 
estimation of actual k. Thus the scale factor used for non-linear scaling experiments 
is given by, 

k = + 1(^2 /iy/ + kz /IY/) 

3 


B Results 
Normalization Plots 

The distances between average (F^Fz) points after applying various nor- 
malization procedures, to the single database, are shown in Fig 2.6. The weight- 
ing factors, knf, used for Pant’s non-uniform normalization scheme, are taken from 
Tab 2.1, while for the proposed scheme uses the smooth weighting function, j{f) 
obtained Eqn 2.6. 

F-ratio 

Since discriminabilty between vowel clusters is as important as reduction 
of variance with any given vowel clusters, a good measure for the usefulness of the 
normalization schemes, would be the F-ratio[18] 

In deriving the F-ratio separability, let Mi and Ri denote the mean formant 
(Fi, Fz, Fz) vector and its covariance matrix, respectively, of the vowel class. 

I 

An equal probability of vowel classes is assumed. Let Mo = (1/F) X) where I 
denotes the number of vowel classes being compared. Then between-class Sb and 


13 




Figure 2.6; Effect of normalization on female cluster. 

Figvre shows the average {Fi,F 2 ) points of normalized female cluster. Normaliza- 
tion being done by various methods. With normalization distance between male and 
female clusters are reduced. Fi and F 2 are in Hz 

within class scatter matrices, are computed by, 

7 S 

i=i 

1 

i=l 

The separability criterion is then given by, 

J = trace{{Sb + Sw)~^Sb} (2-11) 

The cluster discriminability (in terms of F-ratio, J, using first three for- 
mants) of intrinsic, linearly scaled, Fant’s non-linearly scaled and proposed non- 
linearly scaled vowel clusters from Peterson and Barney database are tabulated in 


14 




F-Ratio 

Ref. Male 

Ref. Female 


Ad, 

Ad. 

Ad. 

Ad. 



& 


& 


Only 

Ch. 

Only 

Ch. 

Unwarped 

2.21 

2.01 

2.21 

2.01 

Uniform 

Scaling 

2.45 

2.43 

2.45 

2.43 

Simple 

Non-uniform 

2.46 

2.43 

2.52 

2.48 

Fant’s 

Non-uniform 

2.70 
1 

2.68 

2.75 

2.72 


Table 2.2: Clxaster discriminability in terms of F-Ratio. 

Performance of the proposed non-linear scaling function as compared to other meth- 
ods, applied on Peterson and Barney data [16]. Here Ad. stands for adult speakers 
and Ch. stands for child speakers. 


Table 2.2. In Eqn 2.11 as the separability improves, J should approach the ideal 
value of 3. Here one should note that the F-Ratio in case of uniform scaling does not 
change with change in the reference speaker, as the variance measure is unaffected 
by a change in linear scaling. 

ResiduaJ Variance 

For each subject, Eqn 2.8 and Eqn 2.9 give the kn to be used with respect 
to the reference speaker. This is our “prediction” of the formant specific scale factor 
value kn- The actual or the observed scale factor kgba of the subject may be quite 
different. The efficacy of the normalization scheme is reflected by how close the 
prediction is to the observed kn (= kobs)- This is given by the residual variance 
remaining after normalization [12]. Since with reference female speaker the cluster 
separability for the proposed method is higher than with reference male speaker (See 
Tab 2.2), the residual variances for the first three formants, Vn, between male-female 
clusters after normalization is tabulated only for this case in Table 2.3. 


15 






U2 

Fa 

Uniform 

19.76 

14.22 

4.77 

Simple Non-uniform 

17.37 

11.61 

4.67 

Fant’s Non-uniform 

15.93 

11.22 

4.51 


Table 2.3: Residual variance after normalization. 

Prediction error residue between male-female cluster after normalization, with ref- 
erence female speaker. Vn denotes the residual variance for the formant after 
normalization. 


Scatter Plots 

The F1-F2 scatter plots of Peterson-Barney data using the various normal- 
ization schemes. Better separability using the normalization method proposed can 
be seen in Fig 2.7. 

2.4 Summary 

The reasons for scaling of formant patterns of speech signals, and its nature was 
noted. The non-linearity of the scaling function, was modelled and used in the 
proposed non-uniform normalization procedure. The proposed method of approx- 
imating the non-linearity only as function of frequency was compared with both 
linear scaling and non-linear scaling with explicitly modelling the non-linear scaling 
as a function of both vowel category and formant number, as proposed by Fant. The 
proposed method needs no more information than the uniform scahng method, but 
performs better, both in terms of F-ratio and residual variance measures, while its 
performance is slightly less than that of Fant’s non-uniform scaling method, which 
requires additional information. 


16 





F1(inHz) FI (in Hz) 


Proposed Non-Uniform Normalization Fant Non-Uniform Normalization 



F1 (in Hz) F1 (in Hz) 


Figure 2.7: Scatter plots of Fi — Fa for 10 Vowels 

Figure shows the Scatter plots of Fi - Fa for 10 Vowels from Peters on-bamey data 
with and vhthout normalization. As seen in the figure the proposed normalization 
scheme provides good separability among vowel clusters. 


17 






Chapter 3 


A Warping Function for 
Non-uniform Vowel Normalization 


It is a commonly accepted concept, that vocal tract length scaling results in scal- 
ing of the resultant spectra as a whole. This scaling is assumed to be of the same 
non-linear behaviour as observed in case of formant patterns. The proposed nor- 
malization procedure, in its present form, is applicable only to discrete formant 
patterns. But the state of the art speech recognizers today make use of continuous 
spectral patterns, hence for the proposed non-uniform normalization to be imple- 
mented on a recognizer/classifier, we need to extend the procedure for continuous 
spectral patterns [19]. 

3.1 Non-linear Scaling of Continuous Vowel Spec- 
tra 

The proposed non-linear scaling was tested on some continuous spectra of vowel 
data, as a first step in applying it to a recognizer. Spectral pattern alignment 
for speakers of both the genders and of all ages were tested. For every speaker 
in the database, his/her Vocal Tract Length (VTL) and hence the scale factor is 
calculated. This gross scale factor, a, is used for linear scaling of spectra. The 
non-linear scaling operation uses both the scale factor, a, and the proposed scaling 
function, j{f) (Eqn 2.6). 


18 



3.1.1 Vocal Tract Length Estimation 


For VTL estimation experiments, the steady vowel portions are extracted from Hil- 
lenbrand database, consisting upto four chosen vowel contexts, spoken in "hVd" 
context. Twenty speakers five each from adult males, adult females, boys and girls 
were analysed. Each vowel context chosen represented a vowel category, /IH/ in the 
front vowel category, /AE/ in open vowels and /UH/ in rounded vowels. Context 
/AH/ was made use of, for vocal tract length estimation, as it can be approximately 
modelled well by a uniform tube [7]. It can also be noticed from Fig 2.1 that the 
distance between the predicted and actual (Fi, F 2 ) points, is the least for the vowel 
/AH/, hence supporting the assumption. 

All data, sampled at 16KHz, are sectioned with an overlapping window of 
20ms frame size and with an overlap of 10ms. A first order backward difference of 
preemphsLSis, and Hamming windowing is done. Stationary part of the vowel data 
frame is subjected to Weighted Overlap Spectral Averaging (WOSA)[20] to compute 
the smooth spectrum (See Feature Extraction, Sec 4.1). 

One idea that has been suggested for computing the VTL is to use higher 
order formants, since they are mostly affected only by the overall VTL [10]. Here 
the basic assumption is that, the higher formant frequencies do not deviate m-uc/i, 
from those of a uniform tube, having the same length. The length I is estimated 
from the formant frequency F,- as, 


(2i — l)c 
4Fi 


(3.1) 


where c is the velocity of sound in air. Accuracy of this approach has been tested in 
[6]. Manually upto five formants are located for each of the twenty speakers, which 
are then used for their vocal tract length estimation. An average of these lengths, 
are taken as the actual VTL of the speaker. A scatter plot of the estimated VTL, 
of all speakers in the chosen database, used for analysis is shown in Fig 3.1. 


3.1.2 Experiments and Results 

The implementation details of linear and non-linear scaUng of vowel spectra is pre- 
sented in this section. The smooth spectrum of the frame considered, is obtained 
by calculating the Fourier transform of its autocorrelation function, at equi-spaced 


19 



1 1 1 

1 1 

XK 

— 1 

X 

7K 7K TpK 



+ -f- 

□ DO □ 

Boys 

-b 

Male 

X 


Female 

X 

1 1 1 

Girls 

1 1 

□ 

1 


12 13 14 15 16 17 18 19 

VTL (in cms) 


Figure 3.1; Scatter plot of estimated VTL for all twenty speakers used in the analysis. 
As can be noted on an average, VTL for men = 1 7.3cm, VTL for women = 14- 6 cm, 
VTL for boys = 13.4 cm and VTL for girls=lS.8cm. 


frequencies given by the sampling grid F. Essentially F is a 500 point array of 
equi-spaced frequency values ranging from lOOHz to 7000Hz. The scale factor a for 
a given speaker, is computed by taking the ratio of the subjects estimated vocal 
tract length to that of the average male (reference) vocal tract length (See Eqn 2.1) 
. Linear scaling, is now effected by computing the spectrum at the frequencies given 
by, the scaled sampling grid, a F. In the case of non-linear scaling, o: is no longer 
a fixed number, but is an array, whose values are computed at frequencies given by 
F. Non-linear scaling, is effected by computing the spectrum values at frequencies 
given by 

r = a.*F (3.2) 

where ql = 1 + ( 7 (£)/ 130 ) 

(.* denotes point to point array multiplication) 


20 




A Results 


For no scaling, linear scaling and proposed non-linear scaling cases, the spectral 
patterns for vowels in different vowel categories are shown in Fig 3. 2, Fig 3.3 & 
Fig 3.4. It can be noticed, that for all cases non-linear scaling aligns the spectral 
patterns, from different speakers, much more effectively than the linear scaling. 


(a) 



Figure 3.2: Various scaling functions as applied to the front vowel /IH/. 

Figure shows the spectral patterns for male# 5, female# 5, hoy #2 and girl#l with (a) 
no scaling (b) linear scaling by their respective a's (c) proposed non-linear scaling. 
Abscissa represents equi-spaced samples. 


Gross linear scaling accounts for most of the VTL induced variabilities. The 
proposed non-linear scaling function, Eqn 3.2, does a finer alignment upon this. This 
alignment is at its best, for the first formant region, in all the vowel categories. On 
the whole, for front vowels it performs better than, in other vowel categories. This 
statement is also supported by its better performance, when applied to /lY/ vowel, 
for the discrete formant patterns Fig 2.6. 


21 





Figure 3.3; Various scaling functions as applied to the open vowel /AE/. 
Conditions same as in Fig S.2 


3.2 Approximate Scaling Function 

For the above experiments the scaling function used was the one from Eqn 2.6 [17]. 
A piecewise approximation to this function can be made, as shown in Fig 3.5. Since 
there are very few data points, at some of the low frequencies, only those frequency 
bands that have large number of data points are used, in deriving the piecewise 
approximation. The spectral ahgnment plots (for the vowel /IH/, for the same set 
of speakers), using this approximate scaling function, for the non-hnear scaling, are 
shown in Fig 3.6. Once again it can be noticed that non-linear scaling with this 
approximate scaling fimction still performs better than the linear scaling. 

3.3 Warping-Punction 

Given the superiority of non-linear scaling, it would be of interest to model this 
non-hnearity [20] [12] [21] [22]. Approximating the scaling function (fc„/ — lif)) a 


22 




(a) 



Figure 3.4: Various scaling functions as applied to the rounded vowel /UH/. 
Conditions same as in Fig S.2 


function of frequency alone, is a step in this direction. With the proposed non-linear 
scaling function 2.6 normahzation can be done without context dependence, but we 
still need to estimate the exact scale factor for the speaker. Here we need to address 
certain questions hke: (1) Is there any method of utilizing this knowledge without 
estimating a? (2) Now let us suppose we have the complete knowledge about the 
exact scaling function that exists, then can we come up with a universal warping 
function, which leads to scale invariance? [20]. Such a warping function would be 
of great value in deriving speaker independent robust features. Motivated by such 
questions a warping function for the proposed non-linear scaling function is derived. 
In the analysis, consider a function g, which transforms the frequency axis as f = 
g{a, f) where a is the speaker’s scale factor. The non-linearity in scaling is modelled 
similar to earlier works [20] [21] as, 

/' = s(a, /) = «(/) / = <!'«/ (3.3) 


23 







Frequency in Hz 

Figure 3.5: Approximate scaling function 

Piecewise approximation of the weighting factor as a function of frequency alone. 

where a is the subject’s scale factor, with respect to a reference speaker, 
independent of frequency, while /?(/) depends only on the frequency and is inde- 
pendent of the speaker. The non-linearity of the scale factor is hence captured by 
/?(/). Given the model for non-linearity, Eqn 3.3 can be modified, to obtain scale 
invariance as, 


logif) = P{f)log{a) + log(f) 

(3.4) 

login ^ log(f) . 

(3,5) 

v' = v-\- Constant shift 

(3.6) 

where u = W{f) = 

(3.7) 


In arriving at this warping function W{f), it has b^gen assumed that /?(/') ci /3(/). 
Hence W{f) is a warping function, such that in the warped domain, u, the spectral 
patterns between different speakers are approximately translated versions of each 


24 





Figure 3.6: Approximate scaling function as applied to the front vowel /IH/ The 
non-linear scaling uses the approximate scaling as given by Fig 3. 5 


other. The magnitude of the Fourier transform of these warped spectral patterns 
are invariant to translations, leading to scale invariant features, of real speech signals. 
For the given model, we can derive the warping function as. 


= 1 + 


7(/) * (a - 1) 


hence. 


/?(/) = 


logjl + 

log{a) 


( 3 . 9 ) 


Eqn 3.9 is valid for all values of cx and is invariant to the choice of the reference 
speaker. The question now is what value of a. should be used? If an average male 
speaker is considered as the reference speaker then for an average female speaker its 


25 






(3.10) 


value is around 1.17. Substituting this value in Eqn 3.9 we have, 


0(f) = 

W{f) = 


+ w) 

log(1.17) 

logjf) 

m 


(3.11) 


W{f) is the desired warping function. Let us try to find an insight, into the 



Figure 3.7: /?(/) for non-linear scaling schemes 

P{f) as given by Eqn 3.10 along with other scaling functions (See Sec 3.3) are •plotted, 
f is in Hz. 


physiological relevance, of the proposed /?(/) function. We know, from the studies 
of Fant [7], that the Helmholtz resonator (two tube), is a good approximation of 
the vocal tract, for first formant (low frequency) region, of the front vowels like 
/lY/. It has been noted that, due to a change in the vocal tract size by a factor 
of oc, the first formant for such a vowel (with low Ff) gets scaled by a factor of 
instead of ot. An interesting observation can be made from the Fig 3.7, in which 
the derived beta function value is around 0.5 in the low frequency region. Implying 


26 




30 


25 


20 


Hide et al 

Proposed Warping 

Uniform scaling 

Umesh et al 


15 



1000 2000 3000 4000 5000 6000 7000 


Figure 3.8: Warping functions for various non-linear scaling schemes 

W{f) as given by Eqn 3.11 along with various other scaling functions (See Eqn 3.12) 

are plotted, f is in Hz. 


that the effective scale factor would be y/a as it should have been ideally. The 
scale factor a for a speaker is calculated from the measurements made in the high 
frequency (higher formant) region hence its intuitive that P{f) should approach unity 
at high frequencies (beyond 3500Hz). In order to study its relation with the existing 
approaches, which use similar models for non-linear scaling , the warping functions 
from their proposed /?(/) functions are derived, which are shown in Fig 3.7. For 
these various scaling schemes, their equivalent warping functions are given below. 
A subjective comparison of these functions is shown in Fig 3.8. 

Uniform scaling : 

Eide k Gish : 

Umesh et al : 


W{f) = log{f) 

wu) = ics(/)/( 

W(/) = (3.12) 

Pi 


27 




[100,240) 

[240,550) 

[550,1280) 

[1280,3000) 

[3000,7000) 

6.0 

4.3869 

2.4629 

1.4616 

1 


Table 3.1: /? values for five logarithmically equi-spaced frequency regions, 
As used by Umesh et al [20] 


where /?i is a piecewise function of frequency as given in Tab 3.1. 

3.4 Summary 

A non-uniform vowel normalization procedure, which accomplishes better align- 
ment of spectral patterns for various vowels, between adult and child speakers, was 
presented. It was found to be more effective for front vowels than other vowel cat- 
egories. A warping function aimed at removing inter speaker differences, due to 
VTL variations, was also derived from the proposed non-linear scaling procedure. 
Its physiological relevance and its relationship with other existing warping functions 
were presented. 



28 




Chapter 4 

Recognizers With Scale Factor 
Estimation 


Non-uniform normalization can be implemented on a HMM-based recognizer in 
many ways. All these approaches attempt to "normalize" the parametric represen- 
tation of the speech signal, with the intention of reducing inter-speaker differences 
caused by vocal tract length variations. There are two broad approaches to fea- 
ture based speaker normalization (1) The first approach is to directly estimate the 
“gross scale factor a” either by maximum likelihood (ML) method [5] or by formant 
estimations (physiological motivations) from the speech data [22]. (2) The second 
type of systems use a suitable scale invariant transformation, so that there is no 
need for explicit “a” estimation. Both of these systems can implement either linear 
scaling or non-linear scaling. Hence the speaker normalization techniques mainly 
vary in aspects like (1) need for estimating the “gross scale factor n”, (2) method 
of estimating a (if needed), (3) model for scaling function (linear/ non-linear) . In 
this thesis both the types of recognizers i.e., the one which estimate a explicitly in 
ML sense and the one which uses a scale-invariant transformation, incorporating the 
proposed non-linear scaling function have been implemented and analysed. In this 
chapter recognizers which estimate the scale factor in ML sense have been used to 
implement the proposed non-linear scaling function 7(/) (Eqn 2.6). 


29 



Front End 


Back End 



Figure 4.1: Block diagram of a continuos speech recognizer. 

The block diagram depicts the various stages in a continuos speech automatic speech 
recognizer. 


4.1 Hidden Markov Model Based Speech Recog- 
nizer 

Automatic speech recognizer is a system which allows the computer to recognize 
the spoken words of a person. Automatic speech recognition (ASR) problem is to 
find a sequence of words to a given set of acoustic features. A block diagram of the 
various stages involved in an ASR is shown in Fig 4.1. It clearly involves two stages 
(1) Feature extraction (2) Pattern recognition. 

4.1.1 Feature Extraction 

Feature extraction stage is necessary to reduce the dimensionality of the problem and 
to get a parsimonious representation of the speech signal, where only phonetically 
relevant information is retained, eliminating unwanted distortions. The scheme for 
extraction of Mel Frequency Cepstral CoelHicients (MFCC) and Weighted Overlap 
Spectral Average -Mel (WOS.A-Mel) features, is depicted in Fig 4.2. 

Preemphasis: The characteristics of the vocal tract determine the sound produced 
by a phoneme. Such characteristics are evidenced in the frequency domain by the 
location of peaks (formants). A roll-off of 20 dB/decade is observed in the spectral 
domain of a speech signal. A preemphasis of high frequencies is therefore required 
to obtain similar amplitude for ail formants. A first order FIR filter with a transfer 
function jEr(z) = 1 - 0.975:"^ is used for the purpose. 


30 







Windowing: Traditional methods for spectral estimation are valid only for sta- 
tionary signals. For speech this is true only over short intervals of time. Hence a 
short time analysis can be performed by windowing the speech signal with a suitable 
window like Hamming window. 

Spectral analysis: The spectrum of speech signal contains information about the 
spoken sound and the speaker. Spectral envelope reveals those speech signal features 
which are mainly due to the shape of the vocal tract. Here we discuss two methods 
of smoothing employed to remove the effects of pitch. 

1. Filter Bank: In this Fourier analysis is done on each frame of the speech signal. 
A bank of filters placed uniformally spaced on the mei scale (motivated by 
human perceptual studies) [23] average the spectral energies over bands. These 
filter outputs are taken as samples of smooth spectral envelope. 

2. WOSA: This method is similar to averaged periodogram technique [24] In this 
method each frame of speech is segmented into L overlapping subframes, and 
each subframe is Hamming windowed. In this work a subframe of 64 samples 
each with an overlap of 45 samples was used. An estimate of the autocor- 
relation for each subframe is obtained and averaged over L subframes. This 
averaged autocorrelation estimate is used to compute the smooth-spectrum, 
by Fourier analysis. This method effectively removes the effect of pitch, since 
the duration of each subframe is less than pitch-interval. 

Cepstral features: Logarithm of the magnitude of these samples of smooth spec- 
tral envelope are subjected for inverse Discrete Cosine Transform (DCT). The DOT 
has the property to produce highly decorrelated features [25] . The zero order MFCC 
coefficient is approximately equivalent to the log energy of the frame. This is dis- 
carded and normalized energy of the frame is used in its place. Cepstral mean 
substraction is found to compensate for channel variations [26] . 

Temporal Cepstral derivatives: The cepstral representation of the speech spec- 
trum provides a good representation of the local spectral properties of the signal for 
the given analysis frame. An improved representation can be obtained by extend- 
ing the analysis to include information about the temporal cepstral derivative [27]. 
A simple first and second order difference is used as an approximate (and noisy) 
estimate of the cepstral derivative. 


31 




Figure 4.2; Fcat\ire Extraction stages. 

The block diagram depicts clearly the various stages involved in spectral analysis 
methods. Note ACF denotes autocorrelation function 


32 












4.1.2 Pattern Recognition 

The pattern, recognition problem for ASR can be solved in any of the three paradigms 
(1) Vector quantization (2) Hidden Markov Modelling (3) Artificial Neural Network. 
Speech recognition is associated with lot of uncertainties due to different variabilities 
(like speaker, channel noise, etc.) Stochastic modelling is a flexible method for ac- 
counting such variabilities. One of the major advantages of using HMMs in speech 
recognition problem is their ability to provide a uniform framework for stochas- 
tic representation of both acoustic and lexicon rules, along with other sources of 
knowledge. A very brief introduction to the usage of HMMs in ASR is given below. 
Eminent works can be referred to for more fundamental details on their usage in 
ASR [28] [29]. 

HMMs constitute a “doubly stochastic process” in which the observed data 
is modelled having gcncratc'd by a piecewise stationary process. HMMs can be 
used to model a specific unit of speech such as a sentence, a word, a subword or a 
phone. HMMs are characterized by the number of states, the transition probabilities 
among them and the output probability distribution associated with each state. 
These probability density functions could be either discrete or continues. Discrete 
density functions are modelled by a set of discrete point probabilities for each of 
the vectors in the codebook obtained by Vector quantizing the observation. To 
avoid quantization errors often observations are modelled directly with a mixture of 
Gaussians rc'sulting in Continues density function. 

Training procedure involves optimizing HMM parameters given an ensemble 
of training data. Using segmental k-means algorithm the HMM parameters are ini- 
tialized suitably. Baum- Welch method is used to iteratively estimate the transition 
probabilities and state probability density function parameters [30]. 

With embedding of both acoustic and linguistic knowledge we get a huge 
extended HMM network of states. The goal of decoding process in HMMs is to 
determine the maximum likelihood sequence of states which has generated the ob- 
served signal. An exhaustive search through this network of states is unfeasible 
for any realistic recognition problem. Hence Viterbi-like sub-optimal algorithms are 
adopted for efficient searches. The basic idea in such algorithms is simple, i.e. at 
any discrete time t, all the probabilities of the “hypotheses being in any admissible 
state” are computed At the end only the more likely sequences are selected. Hence 


33 



avoiding exhaustive search. 

Witli the availability of such efficient training and search algorithms HMMs 
provide a unified platform for statistical modelling of the ASR problem. 

4.2 Non-uniform Normalization on the Recognizer 

As it has been noted earlier vocal tract size variations which result in scaling of 
the frequency spectrum of speech signals, account for a major portion of the inter- 
speaker variations. Hence it is intuitive to normalize the frequency spectrum of 
each speaker, with proper estimation of the scaling factor. Speaker’s scale factor 
can be estimated by linking the articulatory variations to spectral parameters. Such 
estimated scale factor, can then be plugged-in to scale the frequency spectrum of 
the speech signal suitably by either linearly or non-linearly [22] [6]. A second class 
of systems differ, in the way the scale factor is estimated for the speaker. In this 
class of rf.'cogniz(u-s, for every speaker in the training set an optimal scaling factor 
(in ML sense), d, is estimated which is then used for warping the utterance. All 
of the warped utterances are used to build a “normalized” HMM. Similarly, during 
recognition, d is estimated for every input speech, which is then used to scale the 
speech utterance. Warped utterance is subjected to decoding on the normalized 
HMM. 

4.2.1 Scaling Factor Estimation in ML Sense 

The scaling factor, a, represents the ratio between a speaker’s vocal tract length and 
some notion of reference vocal tract length. However it is very difiicult to reliably 
estimate one such factor from the acoustic data, especially in the absence of that 
"golden reference speaker". In the maximum likelihood method of estimating the 
scaling factor, the reference speaker notion is served by a reference HMM-model. 
The goal is to choose an a for an utterance, such that its likelihood for the given 
model is maximized. 

d = arg max Pr(Xf | A, Wi) (4.1) 

a 

where Xf: The warped feature set of the utterance. 

A: Reference HMM model. 


34 



Wi'. The transcription of the utterance. 

The optimum scaling factor is obtained by searching over a grid of 13 factors spaced 
evenly between 0.88 < a < 1.12 for adults, 0.76 < o; < 1.0 for children. Reflecting 
the range of 25% variation of VTLs among males & females and about 36% between 
children and adults. 

4.2.2 Training and Testing Procedure 

It is clear from the algorithm that the scaling factor estimation process requires a 
pre-existing HMM model. Therefore, an iterative procedme is used to choose the 
best scaling factor for each speaker and then build a model using the warped training 
utterances. 

Training Procedure: 

1. Divide the whole training database into two halves. 

2. Build an “unnormalized” model (Ar) using one half of the database. 

3. Now for every utterance in the second half choose an “a” such that Pr(Xf { At, Wi) 
is maximized, where ATf is warped feature set using the linear/non-linear warp- 
ing function. 

4. Now these two sets are swapped and the above procedure is repeated itera- 
tively. until there is no significant change in a values. 

5. Build a “normalized” model (Xn) using aU the warped utterances in the train- 
ing set. 

Here it is assumed that phonetic transcription for the training set is available. If this 
is not true then a first pass of decoding is required to obtain the time ahgnment of 
the phonemes in the utterances, which are then used to stack the models and hence 
estimate a. This first pass of decoding is a one time process for every utterance in 
the training set. During recognition, the goal is to scale the frequency axis of each 
test utterance to "match" that of the normalized HMM model A. 


35 



Testing Procedure: 

1. The unwarped utterance Xi and the normalized model are used to obtain 
preliminary transcription of the utterance. Let the transcription obtained from 
the unwarped feature set for the utterance be denoted as Wi 

2. a = argmaxPr(Xf |A, Wj) with linear /non-linear warping function. 

a 

3. The utterance Xf is decoded with the model to obtain the final recognition 
result. 

A block diagram explaining the various stages involved in recognition with speaker 
normalization used is shown in Fig 4.3 

V 

4.2.3 Non-Linear Scaling With Filterbank Analysis 

In the previous section the process of scale factor estimation and its usage in HMM 
training and testing has been explained. In this section an efficient method to imple- 
ment both linear and non-linear scaling function in a recognition system whose fea- 
tures are extracted from a filter-bank is explained. The standard Davis-Mermelstein 
[23] filterbank frontend is used to derive Mel frequency complex cepstrum (MFCC) 
features, for the HMM-based recognizer. It works by first calculating the magnitude 
spectrum of the windowed speech by passing it through a mel-scale filterbank and 
finally taking the inverse cosine transform to arrive at the cepstrum. Though lin- 
ear scaling can be implemented as a simple resampling of the speech signal in time 
domain, but it is more efiicient to push the process onto the filterbank front end 
itself. Moreover time domain resampling is difficult to conceptualize for non-finear 
scaling implementation. Frequency scaling can be implemented by simply varying 
varying the spacing and width of the component filters of the filterbank, without 
changing the original speech signal [11], This process is depicted in Fig 4.4. For 
example to compress the speech signal in the frequency domain, the frequency scale 
of the signal is kept the same but stretch the frequency scale of the filters. Warping 
is implemented by dividing the filter boundaries with a suitable alpha given by, 

F«' = Fi{t)MF,(i)) 

= n(i)/a(Fi(i)) (4.2) 


36 




Figure 4.3; Recognition with speaker normalization. 

The block diagram depicts clearly the various stages involved in recognition with 
speaker normalization used [11] 


37 











For linear warping a{f) = a for all /e[270, 3850] 

For non-linear warping a(f) = -j- 1 

where are the low and high end boundaries of the component filter in 

the filterbank respectively, a denotes the speaker’s gross scale factor estimated and 
7(/o) is the approximate weighting function value (Fig 3.5) in the vicinity of fo- 
Tab 4. 1 gives the Mel spaced triangular filter boundaries with no warping, and the 
corresponding 7(/)/17 values. 






IDCT 


IDCT 




1 


IDCT I ^ 


Figure 4.4: Mel filterbank analysis with frequency warping. 


index 

f in Hz 

t(/)/17 

index 

f in Hz 

7(/)/17 

index 

f in Hz 

7(/)/17 

1 

187.5 

0.5294 

9 

1000 

0.828 

17 

2156.3 

1.47 

2 

312.5 

0.5294 

10 

1093.8 

0.909 

18 

2343.8 

1.41 

3 

406.25 

0.5294 

11 

1218.8 

0.989 

19 

2593.8 

1.29 

4 

500 

0.5294 

12 

1312.5 

1.069 

20 

2843.8 

1.17 

5 

593.75 

0.5294 

13 

1468.8 

1.149 

21 

3125 

1.058 

6 

687.5 

0.588 

14 

1593.8 

1.23 

22 

3437.5 

0.941 

7 

812.5 

0.668 

15 

1781.3 

1.31 

23 

3781.3 

0.7497 

8 

906.25 

0.748 

16 

1937.5 

1.39 





Table 4.1: Mel filter boundaries and 7(/) 

Mel-spaced triangular filter boundaries and the weighting values used 


38 





4.2.4 Experiments and Results 

This section presents an account of the experiments done to investigate the effec- 
tiveness of speaker normalization procedure proposed. Speech recognition accuracy 
is used a performance measure for both linear and non-linear scaling functions for 
speaker normalization. 

A Tasks and Databases 

Two telephone based continues digit databases one from “30K Numbers Corpus” 
Oregon Graduate Institute (OGI) and the other from AT&T were used in these 
experiments. The size of the vocabulary was eleven words: "one to nine" "zero" and 
"oh". The training utterances were endpointed. All of the training dataset was hand 
labelled, with basic phoneme labels (ARPABET). Training set contained around 
6640 utterances. Testing database was made up of two parts one had utterances 
purely from adults while the second part had utterances from children of age ranging 
from ten to seventeen. Testing database was never exposed to models during any 
part of the training. Testing dataset had around 726 utterances with a total of 3517 
words from adults, and 800 utterances with a total of 2834 words from children. 
Number of words in each utterance varied from three to ten. After decoding the 
number of substitution errors(S), deletion errors (D) and insertion errors (I) can be 
calculated. Percent accuracy and percent correct defined as. 

Percent correct = — — ^ — ^Ari00% (4.3) 

— D — S — ^ r\r\Crf i A A\ 

Percent accuracy — — Al00^ (4:.4j 

are used to evaluate the performance of various techniques. 

B Baseline Speech Recognizer 

The experiments are done using monophpne HMM based speech recognition system 
adapted from RES [31] system. The peripheral modules like Mel-filterbank analysis, 
warping function implementation, training and testing modules, WOSA analysis, 
database interface module (OGI- NIST format), label interface module (Arpabet), 
feature file interface module, cepstral mean substraction (CMS) module and finally 


39 



Testset 

Adults 

Children 

Baseline 

(93.54, 91.61) 

(74.06, 70.89) 

Linear 

(95.28, 93.46) 

(83.73, 80.95) 

Proposed 

non-linear 

(94.75, 93.00) 

(81.23, 78.23) 


Table 4.2: Recognition performance 

Recognition performance before and after scaling for speaker normalization With 
scaling factor estimated in ML sense, Performance is given in terms of (% Correct, % 
Accuracy) 


the complete two pass strategy: warping module are added into the basic skeletal 
recognizer from RES as a part of this work. All the modules were coded in C-t-+. 

Each phoneme, including silence, was modelled by three active state con- 
tinues density left-to-right HMM’s. The observation densities were mixtures of five 
multivariate Gaussian distributions with diagonal covariance matrices. 

All data were recorded over telephone set, sampled at 8KHz. Speech signals 
are sectioned with an overlapping window of 20ms frame size (160 samples) and with 
an overlap of 10ms. A first order backward difference of pre-emphasis, and Hamming 
window is done. For each data frame 256 point EFT is taken for Mel-filter bank 
analysis. Thirty-nine dimensional feature vectors were used: normalized energy, 
c[l]-c[12] cepstra derived from mel-spaced filterbank of twenty one filters and their 
first and second order differences. 

C Speech Recognition Performance 

Tab 4.2 shows the recognition performance on two test sets one for adults and the 
other for children. Results show that (1) the linear frequency scaling provides sub- 
stantial improvement over the baseline, the improvement is pronounced for children 
case (2) Contrary to hypothesis, non-linear frequency scahng is not able to improve 
upon linear scaling and needs further investigation. 


40 




4.3 Summary 


A brief overview of the HMM-based ASR was presented. The implementation of one 
such recognizer for digit recognition task was discussed. Methods to incorporate 
the proposed non-linear scaling function from previous chapters were presented. 
Performance of the recognizer over the baseline was compared and analysed. The 
failure of the non-linear scaling functions to provide substantial improvements over 
the performance of the recognizer with linear scaling can have many reasons. The 
proposed non-linear scaling function is motivated from the deviations of the scaling 
function behaviour from linear scaling in vowels. This background is not identical to 
the condition in which it is being used in a digit recognizer. Further investigation is 
needed to really get the advantages of non-linear scaling, in terms of improvements in 
recognition performance, on HMM-based recognizers with scale factor estimation. 


41 



Chapter 5 

Recognizers With Non-Linear 
Scale Invariance 


In the Chapter 4 a recognizer with explicit scale factor estimation was explained. 
The non-linear scale function model derived in Chapter 2 (Eqn; 2.6) could be in- 
corporated in a straight forward manner. The main disadvantage of this method 
is the need to estimate one such scale factor for every utterance, increasing com- 
plexity and introducing estimation errors. There is an another class of recognizers 
which work on the principle of applying suitable scale-invariant transformation on 
the speech spectra [4]. The basic idea is to warp a pair of mutually scaled spectra 
such that in the warped domain they are shifted versions of each other. By taking 
the magnitude of the Fourier transform of these shifted functions we get identical 
coefficients. This method is explained in detail in the following sections. By suitably 
warping the spectra even non-linear scahng can be taken care off [20]. Hence the 
proposed non-linear scaling function is modified so that it can be incorporated in 
such a non-linear scale invariant transformation. 


5.1 Non-linear Scale Invarisince on a Recognizer 

This class of recognizers aim to reduce inter-speaker differences by suitably trans- 
forming the smooth spectra. Here the transformations used are linear /non-linear 
scale invariant [20]. If the linear scaling hypothesis is true then the spectra of two 


42 



speakers are related by 


^^( a ;) = Sb(o:ab!^) 

By exponentially sampling the spectra 

They become shifted functions in the warped domain 

= Sb{i^ + InaAs) 

The magnitude of their Fourier transform leads to scale invariance, 

\TiSA{i'))\ = \J^{SB(u + lnaAB))\ • 

Here exponential sampling denotes linear scaling of the frequency axis, which is 
realized as equal sampling in log domain. Fig 5.1 shows the above process pictorially. 
Two linearly scaled signals after exponential sampling (Log warped) become shifted 
versions of each other. 

Non-linear warping is realized by sampling unequally in logarithmicaly equi- 
spaced bands. Non-linearly scaled functions should turn out to be shifted versions 
in such a non-linearly warped domain. 

The main advantage of scale-invariant transformation is the reduced com- 
putation as compared to systems which explicitly scale estimates the scale factor 
and also provide ease of incorporating non-linearity in scaling function. 

5.1.1 Non-Linear Scale Invariant Transformation 

As noted earlier exponential sampling of the smooth spectrum leads to linear scaled 
functions being shifted versions of each other. Exponential sampling is nothing but 
equi-spaced sampling in log domain. By changing the number of samples in such 
bands we can implement the non-linear warping function. Given the model for 
non-linearity of the scaling function a{feBi) = (Eqn 3.3), we have the 

warping function over N logarithmically equi-spaced frequency bands as, W (feBi) = 
Here A depends only on the frequency band, Bi, but a depends on the 

Hi 

pair of speakers considered. To linearly warp one speaker with respect to other, we 


43 



Scaled signals 



Log warped signals 



Figure 5.1: Log warping of two scaled signals. 

Two linearly scaled signals after exponential sampling become shifted versions of each 
other. Magnitude of the Fourier transform these signals are identical (See Sec 5,1), 


44 





need to compute B{e^) for e^e\Ui^ Li\, where Ui and Li are upper and lower frequency 
limits of frequency band. In the discrete implementation of the warping function 
B{e^) is computed at Mj equally spaced intervals in the region log(Li) to log{Ui). 
Now the problem is to find a suitable set of M'^s for a given /?, set (See [20] for 
details). Let, 


^ _ log{Ui) - log{Li) 

‘ “ Mi 


(5.1) 


be the spacing in the log-frequency band. Then, the exponentially spaced samples 
in the frequency region are for rui = 0, 1,2. .(Mi — 1). If the 

scaling between two spectra is given by -S'a(/) = 'S's(q:ab/) where be the 
scaling factor in the frequency band. By exponentially sampling A{f) we have, 


nii = 0, 1, .., (Mi - 1) 


(5.2) 


Hence in the warped domain they form the piece-wise shifted versions of each other, 

A[mi] = B[mi + ^^^^] (5.3) 

With the shift in the frequency band being, 

%(Q^ab) ^ ^09{(^ab) ^ Pdogjo^AB) .g 

Ai/j Az/j Az/j 

The condition for the shift to be equal in all these N logarithmically equi-spaced 
regions is [20], 


(3iMi = pjMj 

forij = 0, 1, ..(N'- 1). 

The total number of samples held constant, M/s are given by, 

N-l 

^ ] Mi ~ M const 


i=0 


Mi = 


iUeoTi.st 

ft 


* N-l 

E (i/ft) 

j=0 


(5.5) 


( 6 . 6 ) 

(6.7) 


45 



Band(Hz) 

Mi 

Band(Hz) 

Mi 

[270,376) 

7 

[1019, 1421) 

8 

[376, 524) 

11 

[1421, 1981) 

6 

[524, 731) 

9 

[1981, 2761) 

7 

[731, 1019) 

8 

[2761, 3850) 

7 


Table 5.1: Implementation of proposed eight band non-linear frequency warping. 
Mi denotes the number of samples in logarithmically equi-spaced bands. 


Band(Hz) 

Mi 

Band(Hz) 

Mi 

[270, 322) 

3 

[1114, 1330) 

5 

[322, 384) 

4 

[1330, 1587) 

4 

[384, 459) 

5 

[1587, 1895) 

3 

[459, 548) 

8 

[1895, 2262) 

2 

[548, 654) 

5 

[2262, 2701) 

3 

[654, 781) 

4 

[2701, 3225) 

4 

[781, 933) 

4 

1 [3225, 3850) 

4 

[933, 1114) 

5 




Table 5.2; Implementation of proposed fifteen band non-linear frequency warping. 
Mi denotes the number of samples in logarithmically equi-spaced bands. 


Mi values so obtained are to be suitably quantized to integer values, conforming to 
the constraint given by Eqn 5.6. The Mi values used for eight bands and fifteen 
bands for the proposed non-linear scaling function are derived on similar lines and 
are tabulated in tables Tab 5.1 and Tab 5.2. 


5.1.2 Experiments and Results 

The proposed non-linear scaling function is incorporated into the framework of non- 
linear scale-invariant transformation by deriving the Pi set and hence Mi set from 
Eqn 3.10 & Eqn 5.7. Effectiveness of the proposed warping function and other 


46 





warping functions (Eqn 3.12) are quantified in terms of the speech recognition per- 
formance for both matched(adults) and unmatched(children) cases. 

A Tasks and Databases 

The task and databases are identical to the one used in the previous Chapter (See 
Tasks and Databases). The training set has around 11000 utterances, with the 
testset having 846 utterances (3700 words) from adults and 800 utterances (2700 
words) from children. A piecewise approximation to the proposed non-linear scaling 
function with eight and fifteen bands are used for analysis. 

B Baseline Speech Recognizer 

The experiments are done using monophone HMM based speech recognition sys- 
tem adapted from HTK [32]. Warping function implementation & WOSA analysis 
modules were coded in C, and plugged into the basic recognizer. 

Each phoneme including silence, was modelled by four active states, con- 
tinues density left-to-right HMM’s. The observation densities were (1) mixtures of 
five multivariate Gaussian distributions with diagonal covariance matrices (2) Single 
multivariate Gaussian distribution with full covariance matrix. 

For WOSA analysis each data frame is again sectioned into subframes of 
64 samples each with an overlap of 45 samples and Hamming windowing is done. 
127 point autocorrelation function so obtained is subjected to Fourier analysis to 
obtain a smooth spectrum estimate. Thirty-nine dimensional feature vectors were 
used: normalized energy, c[l]-c[12] cepstra derived from 64 samples equi-spaced in 
the warped domain from WOSA analysis (See Feature Extraction 4.1), and their 
first and second order differences. 

C Speech Recog;nition Performance 

Tab 5.3 shows the recognition performance on two test sets one for adults and the 
other for children. Results show that (l)Log warping or exponential sampling (linear 
scaling) performs substantially better in matched case, while Mel warping is of better 
advantage in mismatched case. (2)Proposed non-linear warping fails to provide 


47 



Testset 

Adults 

Children 

Covariance 

Diagonal 

FuU 

Diagonal 

Pull 

Mel warp 

(87.89, 85.86) 

(92.96, 92.35) 

(73.23, 70.36) 

(81.32, 79.05) 

Log warp 

(89.11, 86.23) 

(94.37, 93.40) 

(72.08, 67.66) 

(80.63, 76.72) 

Eide et al 

(86.81, 83.62) 

(91.13, 89.61) 

(67.55, 63.03) 

(72.08, 68.74) 

Umesh et al 

(88.39, 86.36) 

(93.35, 92.71) 

(71.94, 68.42) 

(79.81, 77.11) 

Proposed 

non-linear 

with 8 bands 

(88.50, 86.36) 

(92.90, 92.05) 

(70.86, 67.01) 

(78.48, 75.96) 

with 15 bands 

(88.11, 85.89) 

(93.35, 92.57) 

(71.79, 67.98) 

(78.91, 76.39) 


Table 5.3: Recognition performance of various non-linear scale invariant transforma- 
tions The warping junctions (Eqn 3.12) are used to compute the M'^s in logarithmi- 
cally equi-spaced bands, Performance is given in terms of (% Correct, % Accuracy) 


substantial improvement over the linear scaling performance on the recognizer. (3) 
The Eide & Gish [21] falls apart for mismatched case. 


5.2 Summary 

The basic theory for scale invariant transformation was presented. A method for 
incorporating non-linear scaling in such a paradigm was also given. For the pro- 
posed non-linear scaling function one such scale-invariant transformation was de- 
rived and implemented on a HMM-based digit recognizer. Similar transformations 
were obtained for other non-linear scale functions, which assume similar model for 
non-linearity. These transformations were applied in parallel and results were tabu- 
lated. Results suggest that non-hnear scaling needs more in-depth investigation to 
be meaningfully implemented on a HMM-based recognizer. 


48 




Chapter 6 
Conclusions 


In this thesis we have attempted to exploit the additional information available 
from classical speech analysis studies regarding the nature of scaling that exists 
among speakers of different age and gender. This information is suitably combined 
so that it can be readily incorporated into the state of the art recognizers. The 
proposed scaling fimction has been analysed using different methods like formant 
data analysis, spectral alignments and normalization schemes on a HMM-based 
recognizer. 

When the proposed non-linear scaling function is applied to vowel formant 
database, we obtain encouraging performance in terms of both F-ratio and residual 
variance. The proposed vowel normalization procedure provides substantial im- 
provement over linear scaling in reducing the variance of vowel clusters. Spectral 
alignment plots also show good alignment with the proposed normalization proce- 
dure, for the representative cases. Since these plots are subjective in nature we lack 
objective measures to comment on the effectiveness of the proposed procedure for 
continuous spectra. 

The proposed non-hnear scaling function was incorporated into a HMM- 
based. recognizer. Prom the recognition performance results it can be inferred that 
even though linear scaling provides substantial improvements, the performance of 
the non-linear scaling is not much better than linear scaling (infact it slightly hurts 
the performance). This may be due to many reasons and one of them may be due 
to the coarse sampling of the frequency axis in filterbank analysis. The differential 
variations of the filter boundaries between linear and non-linear scaling is marginal 


49 


due to this coarse sampling. 

In order to overcome this problem, the spectral analysis method was changed 
from filterbank method to WOSA method. By deriving a non-linear scale invariant 
transformation and applying it on the WOSA-smooth spectrum, we achieve slightly 
better results (Percent accuracy is comparable to linear scaling case). 

Inspite of very promising performance of the proposed non-linear scaling 
function in formant data analysis, its inability to provide improvements on a HMM- 
based recognizer suggests further investigations into the suitability of using the 
present HMM-based recognizers for such normalization methods. 

Future Work: 

As has been already noted, further investigations need to be done to fully exploit 
the advantages of non-linear scaling function in a HMM-based recognizer, in a more 
meaningful manner. The proposed scaling function was the result of combining the 
information from earlier vowel analysis study by Fant and others. More refined 
methods of suitably combining these informations can lead to a better scaling func- 
tion. Though the physiological relevance of the warping function derived was just 
touched upon, further detail studies can be done in this aspect. With the perfor- 
mance of the proposed non-uniform normalization procedure being promising for 
vowel data, it may be worth building a vowel classifier incorporating the non-linear 
scaling. 


50 



References 


[1] C. J. Leggetter and P. C. Woodland. Flexible Speaker Adaptation for Large 
Vocabulary Speech Recognition. In Proc. Eurospeech 95, pages 1155-1158, 
1995. 

[2] T. Anastasakos, J. McDonough, and J. Makhoul. Speaker Adaptive Training; 
A Maximum Likelihood Approach to Speaker normalization. In Proc. ICASSP- 
97, pages 1043-1046, Munich, Germany, Apr. 1994. 

[3] T. Anastasakos, F. Kubala, J. Makhoul, and R Schwartz. Adaptation to New 
Microphones Using Tied-Mixture Normalization. In Proc. IEEE Int. Conf. 
Acoustics Speech and Signal Processing, pages 433-436, 1994. 

[4] S. Umesh, L. Cohen, N. Marinovic, and D. Nelson. Scale Transform In Speech 
Analysis. IEEE Transactions on Speech and Audio Processing, January 1999. 

[5] T. Kamm, G. Andreou, and J. Cohen. Vocal Tract Normalization in Speech 
Recognition: Compensating for Systematic Speaker Variability. In Proc. of 
the 15th Annual Speech Research Symposium, pages 175-178, Johns Hopkins 
University, Baltimore, June 1995. 

[6] H. Wakita. Normalization of Vowels by Vocal-Tract Length and its Application 
' to vowel identification. IEEE Trans. Acoustic, Speech, Signal Processing, ASSP- 
■ 25(2); 183-192, April 1977. 

[7] G. Fant. Speech Sounds and Features. M.I.T. Press, Cambridge, MA, 1973. 

[8] P. E. Nordstrom and B. Lindblom. A Normalization Procedure for Vowel For- 
mant Data. In Int. Cong. Phonetic Sc., Leeds, Aug. 1975. 



nun 



[9] James L. Flanagan. Speech Analysis Synthesis and Perception. Springer Verilag, 
NewYork, 1972. 

[10] John R. Deller, John G. Proakis, and John H. L. Hansen. Discrete-Time Pro- 
cessing of Speech Signals. Macmillan, New York, 1993. 

[11] Li Lee and Richard C. Rose. Speaker Normalization Using Efficient Frequency 
Warping Procedures. In Proc. IEEE ICASSP’96, pages 353-356, Atlanta, USA, 
May 1996. 

[12] G. Fant. A Non-Uniform Vowel Normalization. Technical report. Speech Trans- 
miss. Lab. Rep., Royal Inst, tech.,, Stockholm, Sweden, 1975. 

[13] James D. Miller. Auditory-Perceptual Interpretation of the Vowel. Journal of 
Acoust. Soc. Am.., 85(5):2114-2134, May 1989. 

[14] A. K. Syrdal and H. S. Gopal. A Perceptual Model of Vowel Recognition 
Based on the Auditory Representation of American English Vowels. Journal of 
Accoust., Soc. Am.., 79(4):1086-1100, Apr. 1986. 

[15] S. Umesh, L. Cohen, and D. Nelson. Frequency- Warping And Speaker- 
Normalization. In Proc. IEEE International Conference in Acoustics , Speech, 
and Signal Proc., pages 983-986, Munich, Germany, April 1997. 

[16] G. E. Peterson and H. L. Barney. Control Methods Used in a Study of the 
Vowels. J. Acoust. Soc. America, 24(2):175-194, March 1952. 

[17] Vinay M. K., S. Umesh, and Rohit Sinha. A Simple Procedure For Non Uniform 
Vowel Normalization. Submitted at TENCON-2001. 

[18] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. Academic 
Press, San Diego, 1990. 

[19] Vinay M. K., S. Umesh, and Rohit Sinha. A Warping Function For Non 
Uniform Vowel Normalization. Submitted at SPCOM-2001. 

[20] S. Umesh, L. Cohen, N. Marinovic, and D. Nelson. Frequency- Warping in 
Speech. In Proc. International Conference on Spoken Language Processing, 
Philadelphia, USA, 1996. 


52 



[21] Ellen Eide and Herbert Gish. A Parametric Approach to Vocal Tract Length 
Normalization. In Proc. IEEE ICASSP’96, pages 346-349, Atlanta, USA, May 
1996. 

[22] Y. Ono, H. Wakita, and Y. Zhao. Speaker Normalization Using Constrained 
Spectra Shifts In Auditory Filter Domain. In Proc. Eurospeech-93, pages 117- 
124, Berlin, Germany, Sept. 1993. 

[23] S. B. Davis and P. Mermelstein. Comparison of Parametric Representations 
for Monsyllabic Word Recognition in Continuously Spoken Sentences. IEEE 
Trans. Acoustic, Speech, Signal Processing, ASSP-28;357-366, Aug. 1980. 

[24] A. H. Nuttall and G. C. Carter. Spectral Estimation using Combined Time 
and Lag Weighting. Proceedings of the IEEE, 70:1115-1125, Sept. 1982. 

[25] N. Jayant and P. Noll. Digital Coding of Waveforms. Prentice Hall, 1984. 

[26] A. Acero and Richard Stern. Environmental Robustness in Automatic Speech 
Recognition. In Proc. ICASSP-90, pages 849-952, Albuquerque, Apr 1990. 

[27] S. Furui. Speaker Independent Isolated Word Recognition Using Dynamic Fea- 
tures of Speech Spectrum. IEEE Trans. Acoust., Speech, Signal Processing, 
34(l):52-59, Feb. 1986. 

[28] L. R. Rabiner, J. G. Wilpon, and F. K. Soong. High performance connected 
digit recognition using hidden markov models. In Proc. ICASSP 88, April 1988. 

[29] L. Rabiner and B. H. Juang. Fundamentals Of Speech Recognition. Prentice 
Hall, Englewood Cliffs, NJ, 1993. 

[30] L. E. Baum. An Inequality and Associated Maximization Technique in Statis- 
tical Estimation of Probabilistic Functions of Markov Processes. Inequalities, 
pages 1-8, 1972. 

[31] C. Becchetti and L. P. Ricotti. Speech Recognition, Theory and C-t-+ Imple- 
mentation. Jhon Wiley & Sons, England, 1999. 

[32] S. J. Young and P C Woodland. HTK version 1.5 User Reference & Program- 
mer Manual. Cambridge University Engg. Dept. Speech Group, 1993. 


53 



